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Abstract 

Polynomial kernel regression is one of the standard and state-of-the-art learning strategies. However, 
as is well known, the choices of the degree of polynomial kernel and the regularization parameter are 
still open in the realm of model selection. The first aim of this paper is to develop a strategy to 
select these parameters. On one hand, based on the worst-case learning rate analysis, we show that the 
regularization term in polynomial kernel regression is not necessary. In other words, the regularization 
parameter can decrease arbitrarily fast when the degree of the polynomial kernel is suitable tuned. On 
the other hand,taking account of the implementation of the algorithm, the regularization term is required. 
Summarily, the effect of the regularization term in polynomial kernel regression is only to circumvent 
the “ ill-condition” of the kernel matrix. Based on this, the second purpose of this paper is to propose 
a new model selection strategy, and then design an efficient learning algorithm. Both theoretical and 
experimental analysis show that the new strategy outperforms the previous one. Theoretically, we prove 
that the new learning strategy is almost optimal if the regression function is smooth. Experimentally, 
it is shown that the new strategy can significantly reduce the computational burden without loss of 
generalization capability. 
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I. Introduction 

In many scientific fields, large amount of data arise from sampling unknown 

functions. Scientists train data and then synthesize a function / such that fix) is an efficient 
estimate of the output y when a new input x is given. The training process is usually divided into 
two steps. The one is to select a suitable model and the other focuses on designing an efficient 
learning algorithm based on the selected model. Generally speaking, the model selection strategy 
comprises choosing a hypothesis space, a family of parameterized functions that regulate the 
forms and properties of the estimator to be found, and selecting an optimization criterion, the 
sense in which the estimator is defined. The learning algorithm is an inference process to yield 
the objective estimator from a finite set of data. The central question of learning theory is how to 
select a feasible model and then develop an efficient algorithm such that the synthesized function 
can approximate the original unknown but definite function. 

If the kernel methods 0, [|35l are used, then the model selection problem boils down to 
choosing a suitable kernel and the corresponding regularization parameter. After verifying the 
existences of the optimal kernel [ITSll and regularization parameter 0, there are two trends of 
model selection. The one is to pursue some prominent kernels containing multi-kernel learning 
lf25fl . Ifl6l . hyperkemel learning ll23l . Il24l and other kernel selection methods 0, [f32l . ||T9ll . 
The other focuses on selecting optimal regularization parameters for some prevailing kernels, 
comprising Gaussian kernel lfl5l . OH, OTll . polynomial kernel [HTI . 031 fiOll . and other more 
general kernels @, lUTTl . Oil . Il36l . The topic of the current paper falls into the second category. 
We study the parameter selection problem in polynomial kernel regression. 

Different from other widely used kernels 0, the reproducing kernel Hilbert space T-L s of the 
polynomial kernel K s = (1 + x ■ y) s is a finite-dimensional vector space, and its dimension 
depends only on s. Therefore, one can tune s directly to control the capacity of T-L s . Using this 
fact, fill found that the regularization parameter in polynomial kernel regression should decrease 
exponentially fast with the sample size for appropriately selected s. They then attributed it as an 
essential feature of the polynomial kernel learning. The first purpose of this paper is to continue 
the study of fin . Surprisingly, after the rigorous proof, we find that, as far as the learning rate is 
concerned, the regularization parameter can decrease arbitrarily fast for a suitable s. An extreme 
case is that in the framework of model selection, the regularization term can be omitted. This 
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automatically arises the following question: What is the essential effect of the regularization 
term in polynomial kernel regression? 

To answer the above question, we recall that the purpose of introducing regularization term 
in kernel methods is to avoid the overfitting phenomenon fl8|, which is the special case that 
the synthesized function fits the sample very well but fails to fit other points. However, what 
factor causes overfitting in the learning process is usually neglected by numerous programmers. 
Therefore, the essential role of the regularization term can not be captured. To the best of 
our knowledge, there are two main reasons cause the overfitting phenomenon. The one is the 
algorithm-based factor such as ill-condition of the kernel matrix and the other is the model- 
based factor like too large capacity of the hypothesis space. We find that there is only one job 
the regularization term in polynomial kernel regression doing: to assure that a simple matrix- 
inverse technique can finish the learning task. This phenomenon is quite different from the other 
kernel-based methods. For example, since the Gaussian-RKHS is an infinite dimensional vector 
space, the introducing of regularization term in Gaussian kernel regression is to control both the 
condition number of the kernel matrix and capacity of the hypothesis space [|T5l . 

Based on the above assertions, the second purpose of this paper is to propose a new model se¬ 
lection method. By the well known representation theorem Q in learning theory, the essential hy¬ 
pothesis space of polynomial kernel regression is the linear space H := span{(l+a:i-a:) s , • • •, (1+ 
x m -x) s }. Since the algorithm-based factor is the only reason of over-fitting in polynomial kernel 
regression. We can choose n points {f,}" =1 such that the matrix ((1 + tj ■x i ) s ) 7 ^ j=1 is non¬ 
singular. The set {f,;}” =1 can be easily obtained. For example, we can draw {??■}"= i identically 
and independently according to the uniform distribution. Then the pseudo-inverse technique f27l 
can conduct the estimator easily. In the new model, it can be found in Section 4 that there is 
only one parameter, s, need tuning. We also give an efficient strategy to select s based on the 
theoretical analysis. Surprisingly, we find that the difficulty of model selection in our setting 
depends heavily on the dimension of input space. It is shown that the higher the dimension, the 
easier the model selection. 

Both theoretical analysis and experimental simulations are provided to illustrate the perfor¬ 
mance of the new model selection strategy. Theoretically, the new method is proved to be the 
almost optimal strategy if the so-called regression function is smooth. Furthermore, it is also 
shown that the pseudo-inverse technique can realize the almost optimality. Experimentally, both 
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toy simulations and UCI standard data experiments imply that the new method is more efficient 
than the previous model selection strategy. More concretely, the new method can significantly 
reduce the computational burden without loss of generalization capability. The most highlight of 
the proposed model is that there is only a parameter (or almost no parameter of high-dimensional 
case) need to be tuned in the learning process. 

The rest of paper is organized as follows. In the next section, we give a fast review of statistical 
learning theory and kernel method. In Section 3, we study the model selection problem of the 
classical polynomial kernel regression. Section 4 describes a new model selection strategy and 
provide its theoretical properties. In Section 5, both toy and real world simulation results are 
reported to verify the theoretical results. Section 6 is devoted to proving the main results, and 
Section 7 draw a simple conclusion. 

II. A FAST REVIEW OF STATISTICAL LEARNING THEORY AND KERNEL METHODS 

Let X C R J be the input space and Y C R be the output space. Suppose that the unknown 
probability measure p on Z X x Y admits the decomposition 

p(x,y) = px{x)p(y\x). 

Let z = (xj, t/j)^ be a finite random sample of size m, m G N, drawn independently and 
identically according to p. Suppose further that / : X —>■ Y is a function that one uses to model 
the correspondence between X and Y, as induced by p. One natural measurement of the error 
incurred by using / of this purpose is the generalization error, defined by 


£(/) := [ {fix) - y) 2 dp, 

Jz 


which is minimized by the regression function flV), defined by 



We do not know this ideal minimizer f p , since p is unknown, but have access to random examples 
from X x Y sampled according to p. 

Let L 2 be the Hilbert space of p x square integrable function on A", with norm denoted by 
|| • ||p. With the assumption that f p e L 2 p ^, it is known that, for every / e L 2 px , there holds 


£(/)-£(/p) = II/-/Jp 


( 1 ) 
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So, the goal of learning is to find the best approximation of the regression function f p within a 
hypothesis space T~L. Let f n e W, be the best approximation of f p , i.e., f n : = argmin 3eW \\f—g\\ p . 
If there is an estimator f z G V. based on the samples z in hand, then we have 

£(/■) - £(fp) = 11 f P - fn\\l + S{f a ) - £(/«). (2) 

It is well known 0, 0, fl4l that a small % will derive a large bias \\f p — fu\\% while a large % 
will deduce a large variance £(/ z ) — £{fu). The best hypothesis space TL* is obtained when the 
best comprise between the conflicting requirements of small bias and small variance is achieved. 
This is the well known “bias-variance” dilemma of model selection. 

Let K : X x X —» R be continuous, symmetric and positive semidefinite, i.e., for any 
finite set of distinct points {xi,x 2 , ■ ■ ■ : x m } C X, the kernel matrix (K(xi, Xj))” l j=1 is positive 
semidefinite. If Hr is the reproducing kernel Hilbert space associated with the kernel K. Then 
'Hr (see 0) is the closure of the linear span of the set of functions {K x = K(x , •) : x G X} 
with the inner product (■, ■) K satisfying (K x , K y ) K = K(x, y) and 

(K x ,f) K = f(x), VxeX, fen K . 

The following Aronszajn Theorem (see 0) describes an essential relationship between the RKHS 
and reproducing kernel. 

Lemma 1: Let H be a separable Hilbert space of functions over X with orthonormal basis 
{Ok}t={)- Lf is a reproducing kernel Hilbert space if and only if 

OO 

5Z \M x )\ 2 < °° 

k =0 

for all i6l. The unique reproducing kernel K is defined by 

OO 

K(x,y) ■.= J2M x )Mv)- 

k =0 

The regularized least square algorithm in Hr is defined by 

/ z ,a := arg min j— £(/(xi) - Vif + A||/||^| . 

{mfri J 

Here A > 0 is a constant called the regularization parameter. Usually, it depends on the sample 
number m. If the empirical error is defined by 

-1 m 

£*(/) : = —J2(f( x i) - Vi) 2 , 

?7 U=i 


March 10, 2015 


DRAFT 


6 


then the corresponding problem can be represented as 

/z,a = arg min {S z (f) + X\\f\\ 2 K } . 

JEHk K J 

III. Model selection in polynomial kernel regression 

Let X = B' :/ , Y = [— M, M], where B d denotes the unit ball in R / , and M < oo. We employ 
the polynomial kernel K s (x,y) = (1 + x ■ y) s to tackle regression problem. From Lemma Q] it 
is easy to check that the RKHS of K s , H s , coincides with the space (Vf, (■, -) s ), where (•, -) s 
be the inner product deduced from K s according to (K s ( x , ■)> K s(-, y))s = K s (x, y), and V d s be 
the set of algebraic polynomials of degree at most s. 

We study the parameter selection for the following model 

i m 

U,\s ■= arg min - ytf + A||/||^. (3) 

jEris Til 

From m the main purpose of model selection is to yield an optimal estimate for 

£(/z,a ,s)-S{fp). (4) 

The error © clearly depends on z and therefore has a stochastic nature. As a result, it is 
impossible to say something about © in general for a fixed z. Instead, we can look at its 
behavior in expectation as measured by the expectation error 

E p m(||/ Z — fp\\p) := [ || U~f P \\dp m , (5) 

J z m 

where the expectation is taken over all realizations z obtained for a fixed m, and //" is the m 
fold tensor product of p. Obviously, the error © depends on m, s, A and f p . 

Recall that we do not know p so that the best we can say about it is that it lies in _M(@), 
where _M(@) is the class of all Borel measures p on Z such that f p e 0 C L 2 . We enter into 
a competition over all estimators A m : z —» f z and define 

e m (0):=inf sup E p m(\\f p -f z \\ 2 p ). 

Am peM(Q) 

It is easy to see that e m (@) quantitively measures the quality of f z . 

Now, we are in a position to discuss the model selection of polynomial kernel regression. Let 

k = (ki, k 2 ,..., k d ), ki G N, and define the derivative 

, <9l k l f 

D k f( x ) ■= Wl -, 

a Ll X 1 • • • O kd Xd 
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where |k| := ki + ■ ■ ■ + k d . The classical Sobolev class is then defined for any r G N by 



Let II Mt denote the clipped value of t at ±M, that is, II Mt min{M, |f|}sgnf. Then it is 

obvious that [|4lll for all t G R and y G [— M, M] there holds 


^(n M /z,A, s ) - £{f P ) < £{U,x,s) - £(f p ). 


For arbitrary C > 0, \C] denotes the smallest integer not larger than C. The following 
Theorem Q] shows the actions of the parameters in deducing the learning rate and how to select 
optimal parameters. 

Theorem 1: Let r G N and / z A s be defined as in ©. Then, for arbitrary f p G L°°(B d ), there 
holds 



r 1 I 2 r -1 

Furthermore, if f p G and s = m d + 2r , then for all 0 < A < m 2r + d (4 a ) (d+2r) , there exist 
constants C\ and C-z depending only on <1. r, M and p such that, 



peMiw^) 


At first, we introduce some related work and compare them with Theorem [0 The first result, 
to the best of our knowledge, concerning selection of the optimal regularization parameter in 
the framework of learning theory belongs to [|8]|. As a streamline work of the seminal paper 
IfTl . Cucker and Smale |[H1 gave a rigorous proof of the existence of the optimal regularization 
parameter. They declared that there is an optimal regularization parameter A > 0 which makes 
the generalization error the smallest. This leads a prevailing conception that the error estimate © 
should have more terms containing A, besides the term A(4d) 2s . However, it is not what our result 
has witnessed, which seems a contradiction at the first glance. After checking the proof of © 
carefully, we find that there is nothing to surprise. On one hand, the optimal parameter mentioned 
in © aims to the generalization error, containing learning rate and the constant C' 2 , while our 
result only concerns the learning rate. On the other hand, {83’s result is more suitable to describe 
the performance of satisfying ||/||oo < 6'||/|| ft ', where C is a constant independent of 

m. However, this property doesn’t hold for the polynomial kernel since the only thing we can 
confirm is that ||/||oo < 2 s/2 ||/|| s , where s depends on m. 
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After (S]|, we have witnessed the multiple emergence of the selection strategies of regularization 
parameter. The typical results were reported in 0, flU, lfT5l . ll30Tl . Ii33l . ifTTTl . OTI . If36l . Il29ll . 
[1401 and |j4Tl . The optimal parameter may depend on the effective dimension of the marginal 
probability measure over the input space tf5|, [|6l, the eigenvalue of the integral operate with 
respect to the kernel OTIl . [|29l , or the smoothness of the regression function lfl5l . Il36l . The 
most different job we done is that we find the regularization parameter for polynomial kernel 
learning can decrease arbitrarily fast, an extreme case of which is that non-regularization least- 
square can also conduct the almost optimal learning rate. In other words, Theorem ([[]) shows 
that as far as the model selection is concerned, the choice of s is much more important than the 
choice of A. 

For polynomial kernel learning, there are two papers [l33l . li4Tl focusing on selection of 
the optimal parameter. It can be easily deduced from lf33l and j[4T| that the learning rate 
of the regularized least square regression regularized with the polynomial kernel behaves as 

2r __ 

0{m~ 2 r+d+i), which is improved by Theorem |Tj in the following three directions. Firstly, the 
learning rate analysis in Theorem Q] is based on distribution-free theory: we do not impose any 
assumptions to the marginal distribution p x . Secondly, the optimal estimate is established for 
arbitrary (0 < r < oo) rather than 0 < r < 2. Thirdly, Theorem Q] states that the learning rate 
can be improved into the almost optimal one. Therefore, as far as the learning rate is concerned, 
polynomial kernel is almost optimal choice if the smoothness information of the regression 
function is known. 

Eberts and Steinwart | fT5| have already built a similar learning rate analysis for Gaussian kernel 
regression. It is valuable to compare the performance between Gaussian kernel regression and 
polynomial kernel regression. In the former one, there are two parameters need tuning. The one 
is the width of the Gaussian kernel and the other is the regularization parameter. Both the width 
and the regularization parameter are real number in some intervals. Thus, a wisdom method is to 
use the so-called cross-validation strategy [14j Chpater 8] to fix them, which causes tremendous 
computation if the size of samples is large. Differently, the kernel parameter of polynomial 
kernel is a discrete quantity and our result shows that s = m d + 2r is almost optimal choice 
for arbitrary f p e W p . Although, the smoothness parameter r is usually unknown in practice, 
Theorem Q] gives us a criterion to chose s. Since s is discrete, and s may be smaller than \m 1 / f/ ], 
there are only [m 1/V/ ] possible value of s. Noting that if d is large, no matter how large m is, 
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[m l / d ] can not be large than 10 (the most possible case is s < 2 or s < 3 (see Section 5)). 


Therefore, it is very easy to fix the kernel parameter through the cross-validation method. Under 
this circumstance, the computational burden of polynomial kernel regression can be reduced and 
much less than that of Gaussian kernel regression (See Table 4 in Section 5). 

By using the well known plug-in rules, which define a classier g z A>s has the form 



( 8 ) 


Theorem Q] and [f39ll imply that the classier defined as in ([8]) is also almost optimal if the well 


known Bayes decision function satisfies a certain smoothness assumption. Therefore, K s is also 
one of the best choice to deal with pattern recognition problem under this setting. 

At last, we discuss the effect of the regularization term playing in polynomial kernel regression. 
The purpose of introducing regularization term in kernel learning is to overcome the overfitting 
phenomenon. However, the factor causing overfitting is a little sophisticated. It may attribute to 
the high capacity of the hypothesis, the ill-condition of the kernel matrix, or both of them. In 
short, there are two main factors leading to overfitting in kernel learning. The one is the model- 
based factor, i.e., a large capacity of the hypothesis space and the other is the algorithm-based 
factor, i.e., ill-condition of the kernel matrix. In the polynomial kernel regression, Theorem Q] 
shows that arbitrary small A (an extreme case is A = 0) can deduce almost optimal learning rate 
if s is appropriately tuned. Thus, the overfitting phenomenon in polynomial kernel learning is 
not caused by the model-based factor for suitable s. Recall that the kernel matrix, A := ((1 + Xi ■ 
Xj))™j =1 , is singular if m > which is most possible in the learning process. This makes 

the simple matrix-inverse technique can not deduce the estimator directly. Thus, a regularization 
term is required to guarantee the non-singularity of the kernel matrix A. A small A leads to 
the ill-condition of the matrix A + XI and a large A conducts large approximation error. This 
reflects the known tug of war between variance and bias. In short, the overfitting phenomenon 
of polynomial kernel is caused by the algorithm-based factor rather than model selection. Thus, 
we can design a more efficient algorithm directly instead of imposing regularization term in the 
model to reduce the computational burden. This will be the main topic of the next section. 
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IV. An efficient modef selection for polynomial kernel regression 


In this section, we propose a feasible model selection method for polynomial kernel regression 
based on the theoretical analysis proposed in Section 3 and design a learning algorithm with 
low computational complexity. It is analyzed that the regularization term in the model 

arg ( V, - Vif + X \\f\\l] 

i&Hs y m . =1 J 

is to overcome the ill-condition of the kernel matrix A. And the model 


f i m | 

arg “5Sl™S (/W_K) J 

can realize the almost optimality of regression. Both of these make it possible to select a new 

and more efficient model for polynomial kernel regression. Noting that for arbitrary s, T~L S = Vf, 

we can rewrite the above optimization problem as 

f i m A 

arg min ) 2 > . 

Recalling that the dimension of V d is n = ( s + d ), we can find {r/,;}” =1 C B d such that {(1 + rji ■ 
x) s }™ = ! is a linear independent system. Then, 

V d s = Ci( 1 + r)i ■ x) s : ct E r| =: U^n- (9) 

Hence, the above optimization problem can be converted to 


( l m I 

arg m in - E (/ 0 a )-^) 2 • ( 10 ) 

^ J t rtrj,n III J 

Thus, there are two things we should do. The one is to give a selection strategy of and 

the other is to guarantee the non-singularity of the matrix A m)n := ((1 + x t ■ 

To this end, we should introduce the conceptions of Haar space and fundamental system [1341 . 
Let V G C(B d ) be an A^-dimensional linear space. V is called a Haar space of dimension N if 
for arbitrary distinct points xi,... , x N G B d and arbitrary f N G R there exists exactly 

one function s G V with s(xi) — fi, 1 < i < N. The following Lemma |2] [34l Theorem 2.2] 
shows some important properties of Haar space. 

Lemma 2: The following statements are equivalent. (1) V is iV-dimensional Haar space. (2) 
Every u G V /{0} has at most N — 1 zeros. (3) For any distinct points x \.,xn G B d and 
any basis ui,...,un of V, the matrix (iij(xi))(j =1 is non-singular. 
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Of course, if we can find a set of points in B d , {r/, : }" =1 , such that 'H run is the Haar space 
of dimension n + 1, then it follows from Lemma [2] that all above problems can be resolved. 
However, for d > 2, this conjecture does not hold |[34l Theorem 2.3]. 

Lemma 3: Suppose d > 2. Then there does not exist Haar space on B d of dimension N >2. 
Based on this, we introduce the conception of fundamental system with respect to the poly¬ 
nomial kernel K s 

Definition 1: Let ( : = {}" = , c B d . (' is called a /^-fundamental system if 

dirnH^ n = ( s + d ) . 

From the above definition, it is easy to see that arbitrary /f s -fundamental system implies ©. 
The following Proposition Q] reveals that almost all n = { n f s ) points set is the K s -fundamental 
system. 

Proposition 1: Let s,n G N and n = (”+•). Then the set 

{C = (Ci)?=i : dim n^n < n} 

has Lebesgue measure 0. 

Based on Proposition [T] we can design a simple strategy to choose the centers {rj 3 }” =1 . Since 
the uniform distribution is continuous with respect to Lebesgue measure 0, we can draw {;r/ J }” =1 
independently and identically according to the uniform distribution. Then with probability 1, there 
holds 

V d s = q(l + r]i ■ x) s : Ci G r| . 

Now we turn to prove the non-singularity of the matrix A Tn , n , which can be implied from the 
following Proposition [2] 


Proposition 2: Let s,m,n G N. If {x,}"/, are i.i.d. random variables drawn according to 
arbitrary distribution //, and {r/j }" =1 is a // s -fundamental system. Then for arbitrary vector 



holds with probability at least 1 — ^ n ~, where C is a constant depending only on d. 
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It can be easily deduced from Proposition [2] that with probability at least 1 — the matrix 
A rn n is non-singular. Indeed, if A rnn is non-singular, then it follows from Proposition [2] that 
there exists a nontrivial set {cj}" =1 such that 



= 0 . 


This implies 

n 

c jK s (r}j, x) = 0, x e B d , 

3 = * 1 

which is impossible since {rjj} is a K s -fundamental system. 

In the help of the above two propositions, we give an efficient algorithm, called efficient 
polynomial kernel regression (EPKR), based on the model selection strategy (fTOl) . 


Algorithm 1 Efficient polynomial kernel regression (EPKR) 

Input: Let (ay, y,)™ , be m samples, s G N be the degree of polynomial kernel and Kjx, x') = 

(1 + x ■ x'Y be the polynomial kernel 

Step 1: Let n = be the number of centers and {r/ ? }" =l be the set of centers, which is 

a K s fundamental system. {r/ ? , can be drawn independently and identically according to 

the uniform distribution. Set A rnjl := {K s {x h y = (yi, ■ ■ ■ ■ y m ) r - 

Step 2: Set c = pinv(A min )y = (ci,... ,c n ) T , where pinv(A m>n ) denotes the pseudo-inverse 
operator in matlab. 

Output: f m ,a(x) =E"=1 CjKgfajx). 


It can be found that there is only one parameter s in EPKR. To fix s, we can use the so- 
called “cross-validation” method [ jl4l Chapter 8] or “hold out” method [ Jl4l Chapter 7]. To be 
precise, we explain the latter one. There are three steps to implement the “hold out ” strategy: 

(i) Splitting the sample set into two independent subsets z x and z 2 , (ii) using z, to build the 
sequence {f z ,s}[=i ^ an d (iii) using z 2 to select a proper value of s and thus yield the final 
estimator / z . Noting that the choice of s is from 1 to \m 1 l d ~\, if d is large, then then 
is always a small value. The following Theorem [2] illustrates the generalization capability of 
EPKR. 
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Theorem 2: Let r G N, f p G W£,, and / z be the EPKR estimator. Then, 

CxmT^ < e m (W^) < sup E pm (\\f p - U AI f z \\ 2 ) < C 2 m~ log m. 

Theorem [2] shows that the selected model (fTOl) is almost optimal choice if the smoothness 
information of the regression function is known. Furthermore, the pseudo-inverse technique is 
sufficient to realize the almost optimality of the model (fTOl) . Furthermore, it can be easily deduced 
that the computational complexity of EPKR is very small compared to the classical polynomial 
kernel regression method ©. Indeed, for fixed s and TL = the computational complexity 

is ran 2 , while that of © is m 3 . 


V. Experimental results 

In this section, we give both toy and UCI standard data simulations of the model section 
strategy for polynomial kernel regression and the EPKR algorithm. All the numerical simulations 
are carried out in Matlab R2011b environment running Windows 7, Intel(R) Core(TM) i7-3770K 
CPU@ 3.50GHz 3.50 GHz. 

A. Toy simulation 

1) Experimental setting: In this part, we introduce the simulation setting of the toy experiment. 

Method choices: In the toy simulation, there are four methods being employed. The first one 
is Gaussian kernel regression; the second one is the classical polynomial kernel regression; the 
third one is the efficient polynomial kernel regression (EPKR) whose centers {^ }” =1 are drawn 
independently and identically to the uniform distribution; the last one is the EPKR whose centers 
{r/j}"=i be the first n points in the sample data x. 

Samples: In the simulation, the training samples are generated as follows. Let /(f) = (1 — 
2f) 3 _(32f 2 + lOf + 1), where t G [0,1] and a + = nrax{a,0}. Then it is easy to see that 
/ W£([0,1]) and / / W^([0,1]). Let x = {:r; ? -}"A, be drawn independently and identi¬ 

cally according to the uniform distribution with m = 1000 and y = { y ,}"/, = /(:£,') + Si, 
where the noise {5*}™! are drawn independently and identically according to the Gaussian 
distribution lV(0,cr 2 ) with a 2 = 0.1. The test samples are generated as follows, x' = 
are drawn independently and identically according to the uniform distribution with m! = 1000 
and i/i = f(xi). In the numerical experiment, TestRMSE is the mean square root error (RMSE) 
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of the testing data via 10 times simulation. TrainRMSE is the mean RMSE of the training data 
via 10 times simulation. TrainMT denotes the mean training time via 10 times simulation. And 
TestMT denotes the mean testing time via 10 times simulation. 

2) Simulation results: In the first simulation, we study the action of the regularization term 
in the classical polynomial kernel regression ©. In the left figure of Fig.l, it can be found that 
there exists an optimal A minimizing the TestRMSE for optimal selected s. This only means 
that introducing the penalty in © can avoid overfitting. Recalling Theorem Q] the action of 
regularization term in © is to avoid the ill-condition of the kernel matrix. Thus, more simulations 
are required. To this end, we introduce the coefficient-based regularization EPKR (CBR EPKR). 
The CBR EPKR is the algorithm which using 


c pinv(A m , n T SIm,n)y 


instead of Step 2 in the EPKR algorithm, where I m , n = ( a i,j)Tj=i ' s the matrix with a lA = 1, and 
a it j = 0, i t - j. For A = 0, the coefficient-based regularization EPKR algorithm coincides with 
EPKR. If the overfitting phenomenon is caused by the model-based factor, i.e., the hypothesis 
space of © or (fTOl) is too large, then there may exist an optimal A > 0 in the middle figure of 
Fig. 1, which has not witnessed in Fig. 1. Indeed, the TestRMSE is a monotonously increasing 
function with respect to the regularization parameter A. This means that the capacity of hypothesis 
space of © is not large and suitable for the learning task. Thus, we can draw a conclusion form 
Fig.l that the essential effect of the penalty in (fTOl) is to overcome the ill-condition of the kernel 


matrix. 


Readers can find an interesting phenomenon in Fig.l. There is a Ai in the middle and right 
figures of Fig.l such that for 0 < A < Ai, the TestRMSE is a linear function with respect to A, 
while for A > Ai, the slope decreases. We give a simple explanation of this phenomenon. The 
generalization error can be decomposed into approximation error and sample error. It is obvious 
that the approximation error is a linear function with respect to the regularization parameter A. 
However, the relation between the sample error and A is more sophisticated. It is easy to see 
that the hypothesis space belongs to the set 



Roughly speaking, if A < Ai, the covering number of the hypothesis space is larger than the 
quantity appearing in Lemma [5] for a fixed e. When A increase to Ai, the covering number of 
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Fig. 1. The left figure shows the relation between test error and A of model 0. The middle one illustrates the relation between 
test error and A of coefficient-based EPKR and the right one is a detail description. 


the hypothesis space decreases. Once the covering number of the hypothesis space is strictly 
smaller than the mentioned quantity, the sample error decreases with respect to A. Thus, the 
plus of approximation and sample errors is not a linear function with respect to A and the slope 
decreases according to A. 

In the next simulation, we study the importance of s in both model © and model (flOl) . Based 
on the results in Fig. 1, in the upper left figure of Fig. 2, we study the relation between TestRMSE 
and s for model ©, where A is the optimal value of 50 candidates drawn equally spaced in 
[10 -5 ,1]. It can be found that there exists an optimal s minimizing TestRMSE. Since / € 
W^([ 0,1]), it follows from Theorem Q] that the optimal s may close to the value \m 1 ^ 2r+d ' > ~\ = 4. 
It is shown in the upper right figure that the optimal s of © is 7 in our setting. The lower figures 
depict the relation between TestRMSE and s for EPKR. It can found in both of the lower figures 
of Fig. 2 that there is an optimal s minimizing TestRMSE and the optimal value of s is 5, which 
also coincides with the theoretical analysis in Theorem [2] 

In the third simulation, we study the action of the choice of q in EPKR. We compare the 
following four methods of choosing rj in (flOl) . EPKR denotes that // = {//,}" = , are drawn i.i.d 
according to the uniform distribution. EPKR1 denotes that {qi}™ =1 are selected as the first n 
elements of samples. EPKRF denotes that {r]i}™ =1 are chosen as the n equally spaced points in 
[0,1]. EPKRG denotes that {?/,;}" =1 are generated i.i.d. according to the Gaussian distribution 
A/"(l/2,1). It can be found in Fig. 3 that for for suitable s, the choice of q doesn’t effect the 
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Fig. 2. The upper left figure shows the relation between the test error and s of the model <(3j, while the upper right figure 
describes the detail of it. The lower figure illustrates the relation between the test error and s of EPKR, while the upper right 
figure depicts the detail of it. 


learning capability. This verifies the theoretical result of Proposition [0 

In the last simulation, we study the learning capabilities of four methods: classical polynomial 
kernel regression d3]), Gaussian kernel regression fl5l eqs.(4)], EPKR and EPKR1. It can be 
seen from Fig. 4 that the learned functions of all the mentioned methods are almost the same. 
Since both the Gaussian kernel and polynomial kernel are infinitely smooth function and the 
regression function is at most 2-th smoothness, all of them cannot approximate the regression 
function within a very small tolerance. This coincides with the lower bound of Theorem Q] and 
Theorem [2l 

Table 1 shows a quantative comparison among the aforementioned methods. It can be found 
that all of them possess the similar TestRMSE. However, since there are two parameters in 
Gaussian kernel regression and classical polynomial kernel regression, large amount of compu¬ 
tations are required to select a suitable model, i.e., to tune Ac, <5 in Gaussian kernel regression 
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Fig. 3. The left figure shows the comparison of the learning capabilities of EPKR, EPKR1, EPKRF, DPKRG. The others are 
detail descriptions of it. 


and Xp, s in classical polynomial kernel regression. In this simulation, we use the three-fold 
cross-validation flTH Chapter 8] to choose these parameters. We choose 50 candidates of \ c 
and X P as {10 -5 ,10~ 5 + 10 2 ,..., 10 -5 + 49 x 10 -2 }, 50 candidates of s as {1, 2,..., 50}, 40 
candidates of S as (0.01, 0.01 + 0.025,..., 0.01 + 39 x 0.025}. There are only one parameter s 
of EPKR and EPKR1. We also use the three-fold cross-validation method to choose the optimal 
s from {1,..., 50}. It can be found in Table 1 that the training time of EPKR and EPKR1 are 
much less than that of the other two methods. The main reason of this phenomenon is based on 
the following assertions. On one hand, there is only one parameter need tuning in EPKR. On the 
other hand, the computational complexity of EPKR is 0(mn 2 ), which is smaller than 0(m 3 ) for 
small s. Noting that the deduced EPKR (or EPKR1) estimator is a linear combination of n = 5 
basis function, while those of Gaussian kernel regression and polynomial kernel regression are 
1000, the test time of EPKR and EPKR1 is much less than that of the other two methods. 
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Fig. 4. The upper left figure shows the training samples and the functions learned from the aforementioned four learning 
strategy.The other three figures illustrate the detail of it 


TABLE 1 


Methods 

TesRMSEt 

Optimal parameter 

TrainMT (second) 

TestMT(second) 

Gaussian 

0.0090 

A = 10 —0-6 , <5 = 0.06 

172.31 

7.965 

Polynomial 

0.0097 

A = 10 -4 ' 2 , s = 14 

185.59 

11.466 

EPKR 

0.0097 

s = 9 

0.214 

0.0428 

EPKR1 

0.0097 

s = 8 

0.254 

0.0383 


B. UCI data 

1) Experimental setting: In this part, we introduce the simulation setting of the UCI data ex¬ 
periment. All the data are cited from http://www.niaad.liacc.up.pt/~ltorgo/Regression/ds_menu.html. 

Method choices: In the UCI data experiment, we compare four methods containing support 
vector machine (SVM) lf35ll . Gaussian kernel regression (GKR) [T5] Eqs.(4)], classical polyno¬ 
mial kernel regression © (PKR) and EPKR on 9 real-world benchmark data sets covering various 
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fields. We use three-fold cross-validation to select parameters of the aforementioned methods 
among 40 candidates of the width of Gaussian kernel and 50 candidates of the regularization 
parameter A. However, due to the theoretical analysis proposed in Theorem |T[ there are only 
{1,..., } candidates of polynomial kernel parameter s. The centers of EPKR are drawn 

i.i.d according to the uniform distribution on [0,1]. 

Samples: The training and testing samples are drawn according to the following Table 2. 

TABLE II 

Specification of real world benchmark data sets 


Data sets 

Train Number 

Test Number 

#Attributes 

Auto_price 

106 

53 

15 

Boston(housing) 

337 

169 

13 

Stock 

633 

317 

9 

Abalone 

2785 

1392 

8 

Bank8FM 

2999 

1500 

8 

Delta_ailerons 

3565 

3564 

5 

Computer activity 

4096 

4096 

21 

Delta_Elevators 

4759 

4758 

6 

California housing 

10320 

10320 

8 


2) Experimental results: As shown in Table 3, the TrainRMSE and TestRMSE of all the 
mentioned methods are similar. But as far as the TrainMT is concerned, it can be found in 
Table 4 that EPKR outperforms the others. It can also be found in Table 4 that the TrainMT of 
PKR is smaller than GKR and SVM. This is because we use the theoretical result in Theorem 
Q] to select the kernel parameter s. It is shown in Theorem Q] that it suffices to select s in 
the set {1,..., This degrades the difficulty of model selection of PKR. Since the 

TestMT depends heavily on the sparsity of the estimator, we give a comparison of the sparsity 
of the mentioned methods in Table 4, too. In short, as far as the generalization capability is 
concerned, all of these methods are of high quality. However, as far as the computational burden 
is concerned, EPKR is superior to the others. Furthermore, different from the classical polynomial 
kernel regression, EPKR can deduce sparse estimators. In addition, by our theoretical analysis, 
the computational burden of PKR can be heavily reduced. 
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TABLE III 


Data sets 

TrainRMSE 

TestRMSE 

SVM 

GKR 

PKR 

EPKR 

SVM 

GKR 

PKR 

EPKR 

Auto_price 

0.0674 

0.019 

0.0598 

0.0713 

0.0914 

0.1132 

0.0894 

0.0968 

Boston(housing) 

0.0683 

0.0114 

0.0668 

0.0881 

0.0852 

0.1201 

0.0748 

0.0998 

Stock 

0.0491 

0.0141 

0.023 

0.0241 

0.0503 

0.0286 

0.0327 

0.0365 

Abalone 

0.0750 

0.0716 

0.0735 

0.0744 

0.0792 

0.0759 

0.0748 

0.0753 

Bank8FM 

0.0446 

0.0359 

0.0367 

0.0371 

0.0458 

0.0422 

0.045 

0.0475 

Delta_airelons 

0.0417 

0.0369 

0.037 

0.0376 

0.0422 

0.0388 

0.0392 

0.039 

Computer activity 

0.0445 

0.0221 

0.0282 

0.0259 

0.0463 

0.0261 

0.03 

0.0337 

Delta_Elevators 

0.0526 

0.0526 

0.0527 

0.0532 

0.0542 

0.0532 

0.0532 

0.0534 

California housing 

0.0734 

0.0575 

0.0819 

0.0611 

0.072 

0.0625 

0.0832 

0.0696 


TABLE IV 


Data sets 

TestMT 

Mean sparsity 

SVM 

GKR 

PKR 

EPKR 

SVM 

GKR 

PKR 

EPKR 

Auto_price 

26.09272 

2.624 

0.1147 

0.0111 

22.9 

71 

71 

16 

Boston(housing) 

11.3646 

78.901 

2.6658 

0.0178 

56.35 

225 

225 

41.3 

Stock 

30.4451 

25.057 

0.9929 

0.0814 

22.9 

422 

422 

154 

Abalone 

687.021 

790.352 

31.8664 

0.1693 

423.2 

1857 

1857 

45 

Bank8FM 

660.301 

974.147 

39.1042 

0.187 

84.25 

2000 

2000 

153 

Delta_airelons 

291.488 

1421.4 

114.9805 

1.2169 

99 

2377 

2377 

56 

Computer activity 

723.985 

2069.3 

53.8626 

0.2684 

80 

2731 

2731 

253 

Delta_Elevators 

882.53 

2988.9 

198.9652 

1.305 

292 

3173 

3173 

49.7 

California housing 

8489.09 

31031 

1469.6 

3.0532 

924 

7595 

7595 

75 


VI. Proofs 

To prove Theorem Q] we need the following four lemmas. The first one concerning a concen¬ 
tration inequality can be found in [[3] Lemma 3.2]. 

Lemma 4: Let T be a class of functions that are all bounded by M. For all m and a, (3 > 0, 
we have 

P,™ { 3 / 6 7 : 11/ - fX > 2 ( 112 / - /£ - \\y -,/„£) + a + /?} 
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< 14 sup A/" 


P 


40 M 


, J 7 , Li(z/ X ) exp 


an \ 
2568M 4 / 


where x = (xi,..., x m ) G A" m and T , Li(i/ X )) is the covering number for the class J 7 by 
balls of radius f in Li(V x ), with i/ x = 4- )T[™, ^.x, : the empirical discrete measure. 

The second one focusing on covering number estimation is deduced from |fl4l Chapter 9]. 
Lemma 5: Let HmW-s ■— {n M f : / G LL S }. Then, 


Af(e,7r M 'Hs,L 1 (u x )) < 3 


/2 eBP 

V eP 


log 


„ ( a + d \ 
3eB?\( d ) 


eP 


■ 


The third one presents the minimal eigenvalue estimator of a matrix generated by the poly¬ 
nomial kernel, which can be found in [20, Theorem 20]. 

Lemma 6: Let s G N, n = S d ~ 1 be the unit sphere in R d , and {6;}"=i c S d_1 . Then 

the minimal eigenvalue of the matrix A 5 := (1 + £ ? :£j)ij=i> /J mir (A), satisfies 

,, w s\r{d/ 2) 

LminW _ 2sT ( s + d/2 y 

To provide the last lemma, we should introduce the best approximation operator. A function 
rj is said to be admissible lf26ll if rj G C°°[0, oo), 77(f) > 0, and 


supp 77 C [0,2], 77 (f) = 1 on [0,1], and 0 < 77 (f) < 1 on [1,2]. 


Such a function can be easily constructed out of an orthogonal wavelet mask IfTOl . Let 

7T 1 /2r(d + A:)r((d+l)/ 2 ). 


hk : = 


m- 


Define 


(k + d/2)k\T(d/2) 

U k := (h k )~ 1/2 G d k /2 , k = 0,1,..., 


( 11 ) 


where G k is the well known Gegenbauer polynomial with order //, [f26l . The best approximation 
kernel is defined by 

L s (x, y) ■= J2 V (If) v l \ U k {x ■ £)U k (y ■ i)duj d - 1 ( 0 > 
fc =o \ 2s J Jsd ~ 1 

where duj d _ 1 stands for the aero element of S d_1 . It can easily deduced from |[26ll that 

\L s (x,y)\ < Cs d , x,y G B d . (12) 


Let 

E s (f) p -.= mi d \\f-P\\ Ln B d ) 
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be the best approximation error of V d . Define 

£ s f(x):=[ L s (x,y)f(y)dy. (13) 

J B d 

It is obvious that C s f G Vf. The following Lemma [7] which can be found [?, Section 3] shows 
the best approximation property of (C s f)(x). 

Lemma 7: Let 1 <p< oc, and C s be defined in (IT3l) . then for arbitrary / e L p (B d ), there 
exists a constant C depending only on d and p such that 


11/ — £ s f \\Lp (B d ) < CE[g/ 2 ](f)p. 


Now we give the proof of Theorem Q] 
Proof of Theorem [7} We write 


where 


and 


|n M / z ,A, s - fp\\p ■= Ti(z, A, s) + T 2 (z, A, s), 


Ti(z, A, s) := ||n M /z,A, S - fpfp - 2(11 y - UMfz,X,sWm - II y - fp\\m ) 


T 2 (z, A, s) := 2(11 y - - \\v - fp\\m)- 

To bound Ti(z, A, s), we use Lemma 0] and Lemma [5] and obtain 

P /9 m{T 1 (z, A, s) > u} 

{l|n M /.,v - h II? > 2 (||» - n„/ ZiV £ - ||y - f„fj + ^ + ^} 
{a/ e n m h, : ii/ - fX > 2(||3/ - /£ - lb - fXJ + | + |} 


= p, 


5; P p" 


< 14 sup A f 


u 


80 M 


, Tl M U s ,Li(uf)) exp 


um \ 


s+d\ 


(160 eM\ 240 eM 2 V d 

< 42 -log- 

\ u u J 

For arbitrary u > 160 eL 4 /m, we then obtain 


exp 


5136 M 4 

um \ 
5136MV 


J 


2/s+c£\ 

P pm {Ti(z, A, s) > u} < 42 (ml 2 ) 1 d ’ 


exp 


um \ 


5136 M 4 , 
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Then, we get, for any v > 160 eL 4 /m, 


< 

< 


r oo 

E pm {Ti(z,A,s)} <v+ / P p m{T 1 (z, A, s) > 

«/ V 

r oo , v 2f s + d \ 

v+ I 42 ( mM 2 ) v d ; exp 

v + 42 (mM 2 ) K ’ -exp 

V / m 


tm ^ 


5136 M 4 


dt 


vm ^ 


5136M 4 


Setting 


5136M 4 


v = 


m 


log (42 (ml 2 ) 2 ^ 


we have 

E p m.{T 1 (z, A, s)} < (14) 

Now we turn to bound T 2 (z, A,s). It follows from the definition of the truncation operator 
II M and ./ z .a,.s that 


T 2 (z,A,s) = 2(||2/ — IIjv^/z,A,s||^t. — lb — /pllm) 

< 2(11?/ — / z ,A, slim — \\y — /pllm) 

< 2 (|| 2 / - /z,a,s II771, + A||/ z ,a, s || 2 - |b - /p|D 

< 2(||2/ - Csfpfm + A||£ s /p|| 2 - lb - fpfm)- 


Therefore, the definition of f p yields that 

E p m{T 2 (z, A, s)} < 2 (S(C B f p ) - E(f p ) + A||£J P || 2 ). 

Then, (□} yields that 

E pm {T 2 (z, A, s)} < 2||£ s / p - f p \\ 2 p + 2A||£J p || 2 0O . 

Since f p G L°°( B d ), Lemma [Tj implies 

E p m{T 2 (z, A, s)} < C(E[ s / 2 ](fp)oo) 2 + 2A||£ S / P ||^ 0 , (15) 

where C is a constant depending only on d. The only thing remainder is to bound ||£ s / p ||^o. To 
this end, let x 0 G B d satisfying 

ll L b^o,-)llL : = sup ||L s (av)llL 

x£B d 
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Then it follows from ||/ p ||oo < M almost surely that 


I c f II 2 = 

\^sj plloo 


L s (x, ■ )fp{x)dx 


< CM 2 \\L s (xq, -)|| 2 . 


As L s (x o, •) G V d , for arbitrary {£,}" =1 C S d 1 with n = there holds 

n 

L s (x 0 , x) = c i( x + & ' x ) s - 

And c = (ci,..., c n ) T satisfies 


2=1 


c = 1 L, 


where : = ((1 + CiCj) s )i,j=i an d L := (L s (x 0 ,£,i),..., L s (x 0 ,£, n )). Furthermore, it follows 
from Lemma |6] that 


2=1 


s!r(d/2) -rj 


Then (fT2l) together with simple computation implies 


EM 2 <C 2 


s ^3 d 


2=1 


Therefore, 


\C s f p \\ s < CM 2 \\L s (x 0 ,-)\\ s = CM 2 


E c *(l + &') 


2=1 


SCA^^icllKi + ftO' 


2=1 


1/2 


< C' 1 2 s / 2 v / ^ El c *| 2 < C 2 2 s s 2d < C 3 (4d) s . 


Ki=1 


The above inequalities together with (fl5l) and (fT4l) yield 

E^{||n mUxs - fp\\ 2 p } < C ( S -^^ + (E la/2] (f p ) oo) 2 + A(4d) 2 ^ 


m 


The proof of © is finished. 

To prove ([7]) noting that the middle inequality can be deduced directly from the definition 
of e m (W£,) and the left inequality is proved in [j_3]. Chapter 3], it suffices to prove the right 
inequality. Since f p G W^, the well known Jackson inequality Ifl2ll shows that 


(iW/pW 2 < C' S - 2 L 

Thus, let s = |"m 1 /( 2r+d )"|, ([7]) holds obviously for any 0 < A < m~ 2r + d {Ad)~^+ d . This completes 
the proof of Theorem Q] 
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Proof of Proposition \J} Let (' := {Q , C B d . Suppose that there exists a non-trivial set 
{ a i}]= i such that 

n 

y, Oj(l + Cj • x) s = 0, x G B d . 
i=i 

Then the system of equations 

n 

y Oj(l + 0 • Cfc) s = 0, k = 1,..., n 

j=i 

is solvable. Noting that 

(i+o • co* = £ a) (&- • co l = i; 8) £ ct c;c”. 

/c=0 /c=0 |a|=/c 

we obtain 

n s / n 

£ a i0j (l + c, • CO' = £ 8) £ c; £a.Cf 

i,j=l fc=0 |a|=fc \i=l 

where 

C* = —- r, a := (an,..., a d ). 

af. ■ ■ ■ af 

Let 

n 

P(x) := y a;X Q . 

Z— 1 

If 



{C = (C»)?=t : dimK f , n < n}, 


then Q, i — 1 ,n are n distinct zero points of P. Noting that the degree of P is at most s, 
then it can be easily deduced from []4j Lemma 3.1] that the zero set of P, 

Z(p) := {x E B d : P(x ) = 0} 

has Lebesgue measure 0. This completes the proof of Proposition [T] 


To prove Proposition [2l we need the following two lemmas. The first one establishes a relation 
between the d-dimension unit ball B d and the d + 1 dimension unit sphere S f/ , which can be 
found in [38], Lemma 2.1]. 

Lemma 8: For any continuous function / defined on S d , there holds 


[ f{0 duj d(0 = I \f(x, \Jl - |af) + f{x, 

Js d JB d L v 
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Let h,\ be the mesh norm of a set of points A = 1 C S d defined by 

h A := maxmind(£, £,-), 

£GS d 3 

where d(£, £') is the geodesic (great circle) distance between the points £ and £' on S d . The 
second one is the well known cubature formula on the sphere, which can be found in ll2D . 

Lemma 9: If there exists a constant c such that h A < nr C//d , then there exists a set of numbers 
{ai}™ =l satisfying 

m 

J2\ a *\ P < Cm l ~ p . 

i =1 

such that 

/ P{y)dw{y) = jr a i P(x i ) for any P G U d n . 

Js i =i 

Proof of Proposition^- Based on Lemma [8] and Lemma [9j it suffices to prove for arbitrary 
£ > 0, with confidence at least 1 — there holds h\ < e. At first, we present an upper bound 
of h A . Let Dif, r ) be the spherical cap with center £ and radius r. Then for arbitrary £ > 0, due 
to the definition of the mesh norm, we obtain 


P{/r A > e} = Pjmaxmin d(£,£j) > e} < E{(1 — /r(.D(£, £))) m }. 

S,es d 3 

Let ti,... ,t N be the quasi-uniform points [f34l on the sphere. Then it is easy to deduce that 
there exists a constant c > 0 such that 


N 


N < — and S d C [j D(t j ,e/2). 


3 = 1 


If £ G D(tj,e/ 2), then D(tj,e/ 2) C D(£,e). Therefore, we get 


N r 

E{(1 - /i(D((,s))) m } < E / , , (1 - /i(D((,e))rdp 

N N 

j=l J D(tj,£/2) j = i 

iV AT 


< ^ max«(l - w) m < Y max ue mu = 


< 


3 =1 
C 


2 = 1 


eN 

m 


me" 


That is, 


P{h A > £} < 


c 

m£ d 
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This finishes the proof of Proposition [2] ■ 

Proof of Theorem [2} The proof of Theorem [2] is almost the same as that of Theorem H] 


Noting s G 1, | ~m 1/,d ~\ 


and 


m l/(2 r+d) 


1, \m 1/V/ ] for arbitrary r > 0, Theorem [2] can be 


easily deduced from Theorem Q] For the sake of brevity, we omit the details. 


VII. Conclusion 

The main contributions of the present paper can be summarized as follows. Firstly, we study 
the parameter selection problem in polynomial kernel regression. After our analysis, we find 
that the essential role of the regularization term is to overcome the ill-condition phenomenon 
of the kernel matrix. Indeed, as far as the model selection is concerned, arbitrarily small 
regularization parameter can yield the almost optimal learning rate. Secondly, we improve 
the existing results about polynomial kernel regression in the following directions: building a 
distribution-free theoretical analysis, extending the range of regression function and establishing 
the almost optimal learning rate. Thirdly, based on the aforementioned theoretical analysis, we 
propose a new model concerning polynomial kernel regression and design an efficient learning 
algorithm. Both theoretical and experimental results show that the new method is of high quality. 
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